Pascal Held, Otto von
Guericke University Magdeburg, Germany, pascal.held@ovgu.de PRIMARY
Christian Braune, Otto von Guericke University Magdeburg, Germany, christian.braune@ovgu.de
Rudolf Kruse, Otto von
Guericke University Magdeburg, Germany, rudolf.kruse@ovgu.de
Student Team: NO
Did you use data from both mini-challenges? YES
Self-developed scripts to make analysis and visualization.
Python
Matplotlib
NumPy / SciPy
Approximately how many
hours were spent working on this submission in total?
60h
May we post your submission
in the Visual Analytics Benchmark Repository after VAST Challenge 2015 is
complete? YES
Video:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Questions
MC2.1
– Identify those
IDs that stand out for their large volumes of communication. For each of these IDs
a. Characterize the communication patterns
you see.
b. Based on these patterns, what do you
hypothesize about these IDs?
Limit
your response to no more than 4 images and 300 words.
There
are several guests with a lot of communication, but two of them have extreme
high communication frequencies. The user with the ID 1278894 send 190360
messages to 2521 other guests. The user with the ID 839736 send 60812 messages
to 8720 other guests. The next active persons send about 3000-4000 messages to
500 to 600 peers. Due to this large gap, we will focus on the two high
frequency users.
As
it is shown in the figures above, the user 1278894 will send messages very
regularly. A deeper look into the data shows, that the messages are sent in
exactly 5 minutes intervals for one hour followed by a one
hour pause from 12 to 21 o’clock. All messages are sent to almost the
same guests during the three days. Around 15 o’clock occurs a small decrease of
receivers caused by leaving people. The same thing happens in the evening
hours. We assume that the number 1278894 is something like a park information system.
This system is like a newsletter that makes some announcements about shows and
events in the park. Guests have to subscribe this service, so not every guest
receives it.
The
number 839736 differs a lot from this behavior. Messages are sent really all
over the day, and in almost every second. Also the messages are sent to only to
a small group of people. In most cases there are less then five receivers per
second. In total this UID sends less messages to more people then the park
information system. We assume that this is something like a support team, where
guests can ask individual questions. One remarkable abnormality is on Sunday at
12 o’clock. A really high amount of communication is recorded, probably reports
of the crime from the guests.
MC2.2
– Describe up to
10 communications patterns in the data. Characterize who is communicating, with
whom, when and where. If you have more than 10 patterns to report, please
prioritize those patterns that are most likely to relate to the crime.
Limit your
response to no more than 10 images and 1000 words.
To
characterize communication patterns, we focus on the number of send and
received messages and on the number of peers messages are sent to and received
from for each user. We excluded all communication to the personal assistance
service (839736), the park information system (1278894), and the external receipients. From this data we classified the users in
characteristic groups. For this characterization we used a message vs. number
of peers plot for send and received messages. Also we only focus on the people
that are communicating with the app during the weekend. Only the first fitting
rule will be applied to classify the guests.
First we group all
people sending messages to at most 25 other people (blue, [1]). This groups all
the standard guests sending messages to a small peer group only. 6516 people
fit this pattern.
Next we focus on the
group marked with green [2]. They have a lot of peers but are sending at most 3
messages per peer. Maybe these are people who mingle with other people, and
just exchange some contact information to stay in contact after this weekend.
1122 people fit to this pattern.
The next anomaly is
the small red marked group [3] in the receiving plot. This is a closed group of
37 people, receiving at least 20 messages from every peer. The group represents
exactly one peer group. So the members of this group are only communicating to
all others of this group.
The cyan colored
people [4] are the next visible group. They have about 20 peers they send
messages to and at most 25. With 1155 guests fitting this pattern it is also a
large group.
The other people are
not clear to distinguish, but there is a significant amount of people receiving
less then 400 messages (magenta, [5]). These are 281 persons.
To distinguish the
not yet classified people we used a histogram of sent messages. We fit two
normal distributions and get as a possible split threshold at approximately
1500 sent messages. So we split the remaining guests into two groups, one group
of at most 1500 messages sent (yellow [6]) with 1958 guests and the extreme
high frequency group (black, [7]) with more then 1500 messages sent.
Our clustering shows
seven communication patterns. In addition we have the two different patterns
from the park information system and the personal assistance service described
in MC2.1. There are also 1944 users that did not communicate at all which could
be seen as a pattern, too.
In the figure above,
we show the distribution of sent and received messages per cluster. The
Clusters 3, 6, and 7 are the most active clusters, while in the cluster 1 most
people did not send at all. In Cluster 3 almost all people receive the same
amount of messages, while the sending amount differs between the guests. This
indicates, that commonly the messages are sent to all peers in the group.
When we look at the
number of peers the clusters 2, 6, and 7 are the most active ones. Cluster 2
has also a wide range of peers per user. Cluster 3, 4 and 5 have a similar
amount of peers, which looks like normal social behavior for small groups like
school classes. Cluster 3 has no variance, which is based on the strong closed
group.
The next thing we
investigate is the ratios of number of messages send to the sum of messages
send and received. The Clusters 2, 3, 4, 6, and 7 have similar distribution for
more and less sending people. Also the average is almost 0.5 for these
clusters. In contrast to this, Cluster 1 contains a lot of guests that send
fewer messages then they receive. Cluster 5 is a very active cluster. People
send more messages then they receive.
Our last point of
interest is the communication to the special receiver “extern”. These messages
could be some “share with friends”, e.g. some pictures from roller coasters.
The graphic shows, which fraction of people send messages to extern. Guests in
cluster 1, 3, and 5 did not send any message to extern. In Cluster 2 and 4 some
people, one or two out of thousand, send messages to extern. Most of the
external communication is done from cluster 6 and 7, where about six out of
thousand people send messages to extern. There are no messages that have extern
as sender.
MC2.3
– From this data, can you hypothesize when the crime was discovered? Describe your rationale.
Limit your response to no more than 3 images and 300 words.
First of
all, we suppose that anomalies in the park are reasons to write messages to
friends to talk about the happening or to the personal assistance service to
report issues.
So, we
checked the number of send message separated by day and land. Peaks in the
histograms mean that there is a higher communication frequency during this
time. There is a high peak in Coaster Alley on Friday at 16 o’clock. Also on
Saturday and Sunday there are similar peaks in Coaster Alley, but the most
significant peak is on Sunday from 11:30 to 12:30 in Wet Land and corresponding
to that at 12 o’clock in the Entry Corridor. We should mention that the
personal assistance service is located in the Entry Corridor. As the crime
contains a stolen Olympic medal, which should be exhibited in the Craighton Pavilion in Wet Land, this seems to be related to
the crime.
The
figures above show an average message distribution of the last 60 seconds
before the given time. We enriched the message data with the last position of
the sender to locate the message more accurately. As you can see, there is a
high amount of messages, which are from the exhibition hall at 11:30. Maybe
other guests detected the missing medal and reported this to friends and the
park administration.
Later,
at 12 o’clock, there is a hectic communication all over the park. Maybe the
park administration informs all guests about the crime, or maybe the police
arrived and they give instructions via the personal assistance service.